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AN  IMPROVED  RNS  DIVISION  ALGORITHM 

PETER  R  TURNER 

MATHEMATICS  DEPARTMENT,  US  NAVAL  ACADEMY,  ANNAPOLIS,  MD  21402 

Abstract.  This  paper  presents  a  division  algorithm  for  the  Residue  Number  System  which  is  a 
modification,  and  improvement  on,  the  recent  algorithm  of  Hitz  and  Kaltofen  (1994).  The  relative  cost 
of  the  divisions  is  substantially  reduced  rendering  the  RNS  division  feasible  for  computations  which 
are  not  division-intensive  such  as  the  solution  of  a  system  of  linear  equations  of  small  dimension.  The 
advantages  of  this  algorithm  over  the  original  work  of  Hitz  and  Kaltofen  lies  simply  in  using  a  ceiling 
function  in  place  of  the  floor  function.  This  leads  to  a  better  and  simpler  convergence  criterion  and 
test,  and,  more  importantly,  to  a  simple  scheme  for  accelerating  the  potentially  slow  early  iterations 
of  the  Newton-iteration-based  algorithm. 

1.  Introduction.  One  of  the  main  reasons  that  the  Residue  Number  System, 

RNS,  has  not  become  the  basis  of  a  widely-used  general  purpose  integer  or  floating¬ 
point  arithmetic  has  been  the  difficulty  in  performing  division  and  other  nonstandard 
RNS  operations.  Over  recent  years  much  effort  has  been  expended  in  the  attempt  to 
produce  a  good  RNS  division  algorithm.  These  fall  into  several  different  basic  classes. 

Most  of  the  earlier  algorithms  used  some  form  of  binary  conversion,  see  [10],  [2]  and  [3] 
for  example.  The  L-CRT  [5]  was  an  attempt  to  circumvent  the  need  for  full  divisions 
by  using  a  modified  version  of  the  Chinese  Remainder  Theorem  to  make  efficient 
scaling  a  feasible  alternative.  The  division  algorithm  of  [9]  is  based  on  a  binary  search 
using  RNS  comparisons.  Gamberger  [4]  was  one  of  the  first  to  use  an  extended  RNS 
basis  to  find  common  factors  and  scale  the  operands.  The  idea  of  an  extended  RNS 
basis  was  also  exploited  in  the  algorithm  of  Hitz  and  Kaltofen  [6].  Their  approach  was 
to  simulate  a  double-length  accumulator  to  perform  accurate  integer  division  using  a 
modified  Newton  iteration  to  obtain  a  ’’reciprocal”  of  the  divisor  relative  to  the  range 
of  the  base  RNS  being  used.  It  is  this  algorithm  which  provides  the  basis  for  the 
present  work. 

One  of  the  chief  potential  drawbacks  of  their  algorithm  is  that  the  desire  to  have  a 
uniform  starting  point  for  the  reciprocation  phase  necessarily  implies  very  slow  initial 
convergence  to  the  reciprocal.  For  an  RNS  system  with  a  small  dynamic  range  M 
this  0{logM)  first  stage  is  unlikely  to  be  excessively  expensive.  However  for  problems 
where  this  dynamic  range  is  large  this  first  phase  can  be  sufficiently  expensive  as  to 
render  the  system  inappropriate  for  computations  where  the  number  of  divisions  is 
large. 

One  area  in  which  RNS  arithmetic  has  proved  highly  beneficial  is  in  various  aspects 
of  signal  processing.  In  [7],  [8],  [11]  and  [12]  the  use  of  RNS  arithmetic  for  adaptive 
beamforming  problems  using  Gaussian  elimination  was  investigated.  A  divisionless 
form  of  the  algorithm  was  found  to  be  feasible  for  small  scale  problems  [7],  [8].  However 
the  dynamic  range  requirements  grow  too  fast  to  allow  this  approach  to  be  practical 
for  larger  problems.  To  understand  the  importance  of  speeding  up  RNS  divisions 
we  further  observe  that  the  number  of  divisions  needed  for  the  conventional  Gauss 
elimination  algorithm  is  0{N^)  (see  [1]  or  any  standard  Numerical  Analysis  text). 

For  integer  arithmetic  without  any  temporary  use  of  fractional  representations  the 
number  of  divisions  rises  to  O(A^).  Even  with  divisions  included  in  the  algorithm, 
the  wordlengths  and  time-penalty  associated  with  the  divisions  limit  its  usefulness 
[11],  [12].  An  efficient  RNS  division  algorithm  therefore  has  great  potential  in  this 
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and  other  applications. 

The  algorithm  presented  here  has  two  major  advantages  over  the  original  work  of 
[6].  First  by  using  the  ceiling  function  in  place  of  the  floor  function  in  the  reciprocation 
phase  of  the  algorithm,  a  final  comparison  (requiring  a  mixed-radix  conversion  to 
the  extended  RNS  basis)  is  avoided.  Second,  using  this  modified  algorithm  has  the 
additional  benefit  of  making  it  easy  to  recognize  when  an  iterate  is  distant  from  the 
required  reciprocal  and  therefore  to  accelerate  the  convergence.  In  [6]  the  need  for  this 
acceleration  is  recognized  and  is  handled  by  a  sequence  of  comparisons  against  powers 
of  2  to  find  a  better  starting  point  for  the  Newton  iteration.  This  is  implemented 
parallel  comparisons  and  summing  the  results.  Again  this  will  be  adequate  if  the 
dynamic  range  is  small  enough  that  the  necessary  hardware  is  available.  That  is  log  M 
comparisons  must  be  performed  simultaneously  -  each  requiring  a  full  base  extension 
via  a  mixed  radix  conversion.  For  all  but  the  smallest  of  dynamic  ranges  this  is 
likely  to  be  unrealistic.  This  is  the  motivation  for  accelerating  the  basic  algorithm. 
The  speed-up  achieved  is  by  an  asymptotic  factor  of  2  in  the  slowest  part  of  the 
algorithm.  Numerical  results  are  presented  to  verify  the  reduced  iteration  count  of 
the  new  algorithm  over  a  moderate  range.  The  expected  savings  would  grow  with  the 
dynamic  range.  The  new  algorithm  does  not  improve  the  order  oi  the  time-complexity 
but  has  the  combined  merits  of  simplifying  the  original  algorithm  and  speeding  it  up 
most  when  it  is  slowest. 

In  the  next  section  we  begin  with  a  brief  summary  of  the  notation  used  here  and  of 
the  Hitz  and  Kaltofen  algorithm  without  addressing  the  specifics  of  its  implementation 
in  RNS  arithmetic  which  is  ably  detailed  in  the  original  paper  [6].  In  section  3,  the 
convergence  analysis  of  Newton’s  method  for  reciprocation  is  revisited  with  reference  to 
the  particular  context  of  integer  arithmetic.  The  benefits  of  using  the  ceiling  function 
in  terms  of  this  convergence  analysis  and  for  implementation  in  terms  of  avoiding 
magnitude  comparison  testing  are  described.  Section  4  is  devoted  to  accelerating 
the  early  iterations  by  recognizing  the  situation  where  the  iterates  are  well-removed 
from  the  desired  reciprocal.  Numerical  experiments  to  show  the  savings  made  by  this 
algorithm  are  summarized  in  Section  5.  The  details  of  the  RNS  implementation  of  the 
revised  algorithm  are  essentially  unchanged  and  are  only  addressed  explicitly  when 
necessary  for  clarification  of  the  presentation. 

The  paper  concludes  with  a  summary  of  the  findings  together  with  some  discus¬ 
sion  of  area-complexity  and  the  potential  benefits  of  having  several  RNS  processors 
operating  in  parallel. 

2.  The  division  algorithm  of  Hitz  and  Kaltofen.  In  this  section  we  study 
the  division  algorithm  of  Hitz  and  Kaltofen  [6]  which  can  be  used  to  obtain  both 
the  integer  division  and  the  remainder  resulting  from  division  of  two  integers  each 
represented  in  an  extended  RNS  which  plays  the  role  similar  to  that  of  a  double  length 
accumulator  in  regular  binary  arithmetic.  No  CRT  conversions  or  their  equivalent  are 
required  by  this  algorithm.  It  does  however  use  several  RNS  base-extension  operations. 
These  however  can  be  performed  entirely  within  the  RNS  processors  -  provided  enough 
such  processors  are  available.  The  section  begins  with  a  brief  summary  of  our  notation 
and  of  the  Hitz  and  Kaltofen  algorithm  which  has  at  its  heart  a  pseudo- reciprocation 
step. 

The  basic  idea  here  is  that  a  double  length  RNS  representation  is  used  in  the 
following  manner.  Let  pi,  ^2,  Pl^  Pl+i?  •••?  P2L  be  a  set  of  prime  numbers.  (They 
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Qnl;,'  i:^eed  to  be  relatively  prime  for  this  purpose  but  we  shall  need  Gaussian  primes 
for  our  algorithm  in  order  to  perform  complex  RNS  arithmetic  efficiently.)  Assume 
they  are  ordered  so  that  pi  <  PL+i  for  oach  i  =  1,2,  In  [6]  slightly  stronger 

order  assumptions  are  made  but  the  additional  assumptions  play  no  role;  indeed  even 
this  assumption  can  be  weakened  as  we  will  see  in  discussing  the  use  of  this  algorithm 
later. 

Let 

L  _  L 

M  =  J[Pi,  M  =  Y[pL+i 
»=i  «=i 

so  that  M  represents  the  dynamic  range  of  the  base  RNS  &J\.d  MM  represents  the  range 
of  the  extended  RNS.  Some  important  (though  immediately  apparent)  observations  are 
made  in  [6]:  _ 

M,  M  are  relatively  prime  and  M  <  M. 

It  follows  that  mod  M  exists.  Also  the  first  N  moduli  in  the  extended  repr^ 
sentation  of  an  extended  RNS  number  represent  the  remainder  of  that  number  mod  M 

It  also  follows  that  multiplication  of  two  numbers  represented  in  the  base  RNS 
(that  is,  two  numbers  smaller  than  M)  cannot  overflow  the  extended  RNS  range. 

The  division  algorithm  then  consists  of  two  stages  in  order  to  obtain  the  integer 
quotient  [A/FJ  and  remainder  X  modT  where  X,Y  are  positive  integers  in  the  base 
RNS  range.  In  the  first,  the  ’’reciprocal”  YMjY]  of  Y  is  obtained  using  a  modifica¬ 
tion  of  Newton’s  method.  This  is  then  used  to  generated  the  desired  quotient  and 
remainder. 

The  reciprocation  algorithm  is  described  in  [6]  as  follows. 

Algorithm  RECIP 
Input:  Y 

Output:  [M/yj 

begin 
Zi  ^  0 
Z2  ^  2 

while  Z\  7^  Z2  do 
Z\  ^  Z2 

Z2  ^  [Zj  *  (2M  -  y  *  Zi)  /MJ 
if  M  -  y  *  Z2  <  y 

then  return  Z2  else  return  Z2  +  1 
end. 

The  need  for,  and  correctness  of,  the  correction  step  is  established  in  [6].  The 
iteration  in  the  loop  is  just  Newton’s  method  in  an  ’’integerized”  form.  The  division 
by  M,  equality  testing  and  comparison  are  all  achievable  using  the  extended  RNS  basis 
in  ways  which  are  also  detailed  in  [6].  We  shall  discuss  later  only  those  details  which 
are  needed.  Once  this  M-reciprocation  has  been  completed  the  rest  of  the  division  is 
straightforward.  This  algorithm  is  also  given  in  [6]: 

Algorithm  DIVREM 
Input:  A,y 

Output:  [X/yj  and  X  mod y 

begin 


3 


NAWCADWAR-95002-4.5 


Q  ^  [X*RECIP  (F)/MJ 
R^X 

if  R  <Y  then  return  R 
else  return  Q  +  R  —  Y 
end. 

The  comparison  and  division  by  M  are  performed  as  in  the  RECIP  algorithm  and 
again  the  reader  is  referred  to  [6]  for  the  details  of  their  RNS  implementations.  One 
aspect  of  the  implementation  is  worth  commenting  on  however. 

The  comparison  and  division  by  M  both  require  base  extensions  between  the  base 
RNS  and  the  extended  RNS  and  between  the  extension  RNS  based  on  P2L 

and  the  full  extended  RNS  system.  These  operations  require  mixed  conversion  to  the 
corresponding  mixed  radix  systems,  MRS.  This  step  is  expensive  either  in  time  or, 
as  discussed  in  [6],  in  hardware  in  order  to  reduce  the  time-penalty.  The  hardware 
configuration  suggested  is  an  array  of  L{L  +  l)/2  modular  multipliers  with  L  separate 
binary  trees  of  modular  adders  and  a  final  MRS  carry-lookahead  adder.  With  rela¬ 
tively  small  RNS  bases  this  degree  of  parallelism  in  the  processors  may  be  realizable. 
However  the  acceleration  phase  in  [6]  also  requires  log  M  of  these  comparison  units  - 
each  needing  the  base  extension  architecture  just  described. 

For  the  adaptive  beamforming  solution  referred  to  above,  L  is  typically  around 
16  and  logM  around  120  even  for  moderate-size  problems.  The  overall  requirement 
would  therefore  be  for  some  15,000  modular  multipliers.  This  level  of  parallelism 
can  probably  be  better  used  for  purposes  other  than  just  speeding  up  division.  In 
the  case  of  Gaussian  elimination,  there  are  often  several  divisions  using  the  same 
divisor  and  so  a  conventional  MRS  conversion  could  perhaps  be  elficiently  pipelined 
for  this  purpose.  Avoiding  the  magnitude  comparisons  that  are  central  to  the  Hitz 
and  Kaltofen  algorithm  would  also  be  a  significant  benefit.  In  the  remaining  sections 
of  this  paper  we  establish  an  alternative  form  of  the  algorithm  which  avoids  these 
comparisons  and  achieves  a  lesser  speed-up  than  the  log  M  comparisons  but  much 
more  cheaply.  Indeed  it  uses  no  additional  hardware  to  that  of  the  base  algorithm. 

3.  Improved  reciprocation  using  the  ceiling.  It  is  apparent  that  the  critical 
part  of  this  division  algorithm  is  RECIP  and  that  any  savings  which  can  be  achieved 
there  are  worthwhile.  In  this  subsection  we  consider  two  sources  of  improvement.  The 
first  is  the  elimination  of  the  correction  step  by  using  the  ceiling  rather  than  the  floor 
function  in  the  iteration.  The  convergence  analysis  must  be  appropriately  modified  - 
or  simplified  -  to  take  account  of  this.  This  removes  the  need  for  the  final  comparison 
which  is  implemented  in  extended  RNS  by  using  two  base  extension  operations. 

In  [6],  the  number  of  steps  of  the  Newton  iteration  is  also  discussed.  For  conve¬ 
nience  of  the  analysis  this  is  separated  into  two  parts  which  can  be  thought  of  as  a  first 
stage  of  order  0(logM)  which  achieves  an  iterate  with  a  relative  error  of  less  than 
75%  and  a  second  O  (log  log  M)  which  is  the  usual  quadratic  convergence  behavior  of 
Newton’s  method.  They  go  on  to  discuss  a  hardware-intensive  acceleration  of  the  first 
stage  as  described  above.  In  our  setting  this  acceleration  is  unlikely  to  be  practica¬ 
ble  and  so  the  first  stage  constitutes  a  potentially  prohibitive  number  of  iterations  to 
achieve  a  suitably  good  estimate  that  the  quadratic  convergence  takes  over. 

We  begin  with  a  brief  review  of  the  convergence  theory  of  Newton’s  method  for 
the  modular  reciprocation  operation.  The  most  important  aspects  are  that  the  (real) 
Newton  iteration  generates  a  strictly  mcrea^mg  sequence  of  iterates  -  after  a  possible 
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deci-ease  on  tie  first  iteration  if  the  initial  estimate  is  too  large.  In  our  setting  this 
will  imply  a  strictly  increasing  integer  sequence  until  it  converges.  For  fuU  details  of 
the  analysis  of  Newton’s  method  see  a  standard  Numerical  Analysis  text  such  as  [1]. 

The  standard  (real  arithmetic)  Newton  iteration  for  finding  M/T  where  M  >Y 
is  just  the  application  of  Newton’s  method  to  the  solution  of  the  equation 

With  an  appropriately  chosen  starting  point  this  yields  the  familiar  iteration 


If  <  MjY  then  it  is  easy  to  see  that  a;„+i  >  a:„.  Also  from  the  conventional  error 
analysis  of  this  iteration,  we  have 

M  Y  (  mV 

so  that  Xn+i  <  M/Y.  It  follows  that,  at  least  for  n  >  1,  the  sequence  (x„)  is  strictly 
increasing.  It  will  converge,  eventually  quadraticaUy,  provided  only  that  xi  >  0 
which  will  be  ensured  if  xq  <  2MIY.  Usually  somewhat  more  restrictive  bounds 
resulting  from  a  first  order  Taylor  expansion  are  put  on  the  iterates,  namely  that 
xo  e  (If’If)-  These  bounds  also  have  the  advantage  of  preventing  cycles  in  the 
"integerized”  iteration  we  shall  be  considering. 

The  Hitz  and  Kaltofen  algorithm  [6]  generates  an  integer  sequence  by  simply 
taking  the  floor  of  each  iterate.  This  will  also  generate  an  increasing  sequence  which 
’’converges”  to  either  [M/Y\  or  [M/TJ  -  1.  In  this  context  ’’converges”  means  that 
successive  iterates  are  equal.  The  simple  modification  suggested  here  is  that  we  replace 
this  floor  with  the  corresponding  ceiling  so  that  the  iterates  are  given  by 

(2)  = 

which  can  be  implemented  easily  irrespective  of  which  of  floor  and  ceiling  is  regarded 
as  a  primitive  function. 

Clearly  at  each  stage  this  iterate  will  be  greater  than  its  counterpart  in  the  original 
(or  indeed  in  the  continuous)  algorithm  and  is  again  a  strictly  increasing  sequence  until 
’’convergence  ”  is  achieved.  The  acceleration  due  to  the  use  of  the  ceiling  here  is  likely 
to  be  slight  and  should  not  be  thought  of  as  a  particularly  significant  aspect  of  this 
modified  algorithm. 

As  with  any  discrete  or  integerized  algorithm,  ’’convergence”  has  a  special  inter¬ 
pretation  in  this  context.  The  strictly  increasing  nature  of  the  sequence  guarantees 
that  eventually  an  iterate  is  obtained  such  that 


at  which  point  either  Zn+\  =  [^1  or  .Z„+i  =  [^1  -  1  =  [^J.  That  these  are  the  only 
possibilities  follows  from  the  facts  that  the  error  is  being  reduced  at  each  iteration  and 
a;n+i  <  M/y . 
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Thus  eventually  the  iterates  either  remain  static  at  or  oscillate  between 
and  ,  Also  two  successive  iterates  cannot  be  equal  unless  convergence  to  [^]  has 
been  achieved.  Similarly,  the  oscillation  occurs  if,  and  only  if,  —  1.  Thus 

we  have  a  very  simple  pair  of  convergence  criteria  to  test.  The  iteration  is  halted  if 

either  Zn+i  =  or  -  1. 

Thus  we  have  established 

Theorem  1.  The  Modified  RECIP  algorithm  below  returns  the  correct  value  of 

m 

Algorithm  Modified  RECIP 
Input:  Y 

Output:  [M/y] 

begin 
Zi  ^  0 
Z2  ^2 

while  Zi  /  Z2  and  Zi  —  1  ^  Z2  do 

Zi  ^  Z2 

Z2  ^  [Zi  *  (2M  -  y  *  Zi) /M] 

return  Z\ 
end. 


The  convergence  test  and  magnitude  comparison  required  by  the  original  RECIP 
algorithm  are  thus  replaced  by  a  pair  of  equality  tests.  (We  observe  that,  of  course, 
the  addition  or  subtraction  of  1  in  any  RNS  system  is  just  the  corresponding  addition 
or  subtraction  of  1  in  each  residue.)  An  analysis  of  the  frequency  with  which  the 
oscillation  can  occur  shows  that  if  we  regard  Y  as  uniformly  distributed  in  the  interval 
(0,  M]  then  approximately  75%  of  the  time  the  iteration  will  converge  to  without 
any  oscillation. 

Note  that  the  modified  algorithm  above  is  not  the  final  revised  algorithm  that  is 
the  main  objective  of  this  paper.  However  it  does  represent  the  basis  of  the  use  of 
the  ceiling  function  in  the  integerized  Newton  iteration  which  is  at  the  heart  of  the 
improvements. 

We  conclude  this  section  with  the  observation  that  the  algorithm  DIVREM  in 
Section  2  also  finishes  with  a  magnitude  comparison  on  which  is  based  a  possible 
correction  step.  In  both  the  original  algorithm  and  the  version  being  developed  here, 
this  is  avoidable  if  only  the  integer  part  of  the  result  is  required  and  an  absolute 
less  than  1  is  acceptable.  In  the  original  algorithm  this  is  achieved  by  simply  using 
I”.]  in  place  of  [.J  in  DIVREM  and  eliminating  the  test.  The  corresponding  stage 
of  the  revised  algorithm  would  use  RECIP(y)  =  and  then  to  set  Q  ^  [X  ^ 
RECIP  (y)  /MJ.  That  this  yields  an  absolute  error  less  than  1  follows  since 


so  that 


X  X  X  ^  X,  \M|Y^ 
^  M - 


> 

X*  \M/Y] 

> 

X 

lY  J 

M 

Y. 

and,  except  when  X/Y  is  an  exact  integer,  +  ij  =  +  1  =  ^j.  (If  the  true 

result  is  an  integer,  then  this  quotient  will  be  obtained  exactly.)  If  a  true,  positive, 
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lOTiainder  is  also  ret^uiYed  such  as  in  [6],  then  some  correction  step  similar  to  that  in 
DIVR.EM  IS.  necessary. 

4.  Accelerating  the  early  iterations.  A  second  consequence  of  the  use  of  the 
ceiling  function  in  the  iteration  is  that  it  becomes  easy  to  recognize  when  the  iterates 
are  well-removed  from  the  desired  reciprocal  -  and  to  make  appropriate  adjustments 
for  this.  The  effect  is  that  the  number  of  iterations  used  in  this  first  phase  is  reduced 
by  an  asymptotic  factor  of  2. 

The  acceleration  is  based  on  the  fact  that  if  x„  <  MjY  in  (1)  then  i„+i  ~  2a;„ 
so  that  the  Newton  iterates  approximately  double  with  each  early  iteration  from  a 
starting  value  which  is  too  small.  It  also  follows  from  (1)  that  x„+i  <  2x„  and 
therefore  if  the  floor  function  is  used  as  in  the  original  [6]  algorithm,  we  can  never 
have  ZnJri  =  2^n-  Using  the  ceiling  function  allows  the  possibility  that  Z„+i  =  2Zn. 
This  condition  is  also  an  easy  one  to  check  since  multiplication  by  2  is  a  simple  modular 
operation.  Let  us  consider  when  this  ”iterate-doubling”  can  occur. 

From  (2)  we  see  that  =  2Z„  if 


which  is  to  say  that 


or  Zn  <  ^/MJY.  This  suggests  that  we  simply  test  whether  Zn+i  =  “^Zn  and  if  this 
returns  ’’True”  then  replace  Z„+i  with  Z„+i  =  -  or  perhaps  some  other  well-chosen 

alternative  value.  We  also  observe  that  the  quantity  has  already  been  computed 
and  so  no  additional  computation  is  required  beyond  the  equality  test. 

This  will  clearly  accelerate  the  early  iterations  especially  in  the  situation  where 
M/Y  >  1.  If,  however,  M/Y  is  very  large  then  several  iterations  may  still  be  required 
before  iterates  larger  than  y/MjY  are  reached.  There  may  be  significant  gains  to  be 
made  by  including  a  further  scaling  into  the  revised  Zn+i  when  doubling  has  taken 
place.  The  rest  of  this  section  is  devoted  to  a  brief  analysis  of  this. 

Temporarily  write  c  =  MjY .  Our  underlying  Newton  iteration  is  then 


which  is  the  normal  iteration  for  computing  the  reciprocal  of  1/c. 

The  modified  algorithm  RECIP  of  Section  3  generates  iterates  2,  4,  8, ...  ,  Zn  =  2" 
until  Zn  >  y/c  whereas  the  further  modification  outlined  above  would  generate  the 
iterates  2,  4,  16,  ...,  Zn  =  2^"“‘,  again  until  Zn  >  y/c.  The  first  of  these  sequences 
requires  \  log  cj  iterations  to  satisfy  this  condition  after  which  iterate  doubling  will 
not  happen  and  eventually  the  quadratic  convergence  will  take  over.  The  second 
sequence  takes  [log  (log  c)]  iterations  to  reach  this  point.  Thus  the  slowest  part  of 
the  algorithm  is  accelerated  from  O  (log  c)  to  0  (log(log  c))  iterations  at  the  cost  of  an 
additional  modular  operation  and  an  equality  test  per  iteration.  This  however  does  not 
bring  the  overall  operational  complexity  of  the  algorithm  down  to  O  (log(log  c))since 
the  number  of  iterations  required  to  attain  a  value  within  a  fixed  relative  error  will 
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still  be  0  (logc)just  as  in  [6].  What  is  achieved  is  an  improvement  by  a  factor  of  about 
2  in  this  part  of  the  algorithm. 

The  fixed  relative  precision  used  in  [6]  was  3/4.  That  is,  they  establish  that 
0  (log  c)iterations  are  required  to  generate  an  iterate  Zn  >  cf  4  after  which  0  (log(log  c))further 
iterations  are  required  to  achieve  convergence.  Clearly  it  is  desirable  to  accomplish 
the  first  stage  with  as  few  iterations  as  possible. 

In  the  case  of  iterate-doubling,  it  is  clearly  possible  to  set  Zn^i  =  aZ^  where  a  is 
a  constant  to  be  chosen.  The  idea  is  to  get  beyond  the  threshold  of  ^/c  more  rapidly 
without  generating  an  iterate  which  is  so  large  as  to  set  up  oscillations  by  in 

turn  generating  very  small  iterates.  How  should  we  choose  a? 

The  elementary  function  iteration  analysis  ([1],  Section  3.2,  for  example)  of  New¬ 
ton’s  method  for  reciprocals  implies  that  quadratic  convergence  in  the  present  situation 
is  guaranteed  for  any  Zq  G  (c/2,3c/2).  This  in  turn  suggests  that  choosing  a  =  3/2 
is  feasible.  This  choice  wiU  certainly  accelerate  the  initial  growth  of  the  iterates  when 
Zq  <C  c.  It  turns  out  to  be  the  optimal  constant  choice  for  our  integer  arithmetic 
setting. 

First,  if  Zn  <  3c/2  then  it  follows  that  >  3c/4  and  the  subsequent  relative 
errors  indeed  decrease  quadraticaUy.  To  see  this  consider  again  the  continuous  form 
of  the  algorithm  and  write 


Xn  —  C  •  Gn 


Then  the  recurrence  for  the  sequence  (on)  is  given  by 


and  if  Gn  =  1  —  2  ^  then 


a„+i  =  (l  -  2-^)  (2  -  [l  -  2"*])  =  1  -  2-2^ 


Next,  a  larger  value  of  a  could  be  chosen  to  accelerate  this  early 
further  provided  only  that  Zn+i  >  c/2  whenever  Zn^i  <  \/c  and  Zn  > 
a  must  be  such  that  Xn  =  ac,  or  Cn  =  a,  yields  Gn+i  >  1/2.  This 


2a  +  1/2  <  0  which  yields  a  G  1  ~  v^l/2, 1  +  y/1/2^ .  Since  we  are 


growth  even 
c.  Therefore 
implies  that 
interested  in 


a  >  1,  this  suggests  that  the  largest  acceptable  value  is  a  =  1  +  7172  ~  1.7071.  A 
more  intricate  analysis  shows  that  the  requirement  that  a^+i  >  1/2  is  more  restrictive 
than  necessary  and  that  even  larger  values  of  a  <  2  would  also  be  feasible. 

For  example  a  =  15/8  would  lead  to  faster  growth  of  small  iterates.  Furthermore, 
if  Xn  =  15c/8,  so  that  Gn  =  15/8,  then  Gn^i  =  15/64  ~  0.23,  an+2  =  1695/4096  ~ 
0.41  and  the  convergence  behaves  essentially  quadraticaUy  thereafter.  In  order  that 
cycles  cannot  be  set  up  by  this  process  it  is  necessary  that  ^  >  ^/c  which  in  turn 
implies  that  c  >  (64/15)^  c:::  18.2.  Thus  the  factor  a  =  15/8  should  not  be  used  with 
^n-i  <  4  <  Vc. 

There  is  another  good  reason  for  avoiding  a  factor  such  as  this  too  soon  in  RNS, 
or  any  integer,  arithmetic.  The  result  of  the  multiplication  should  be  an  integer  and 
should  be  computable  using  modular  operations.  With  our  choice  of  Zi  =  2,  the 
factor  a  =  3/2  would  always  result  in  even  integers  during  this  first  growth  phase  and 
could  therefore  be  repeated.  It  could  also  be  a  precomputed  multiplier  for  modular 
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operations  where  we  know  there  is  no  risk  of  overflowing  the  dynamic  range  -  which 
is  tke  present  situation,  ot  =  3/2  is  the  largest  such  constant. 

The  sequence  of  initial  iterates  used  until  Zn  >  \/c  becomes  2,  6,  54,  4374,  ... 
which  with  Zq  =  2  is  given  by  Z„  =  2  •  3^”“^.  The  maximum  number  of  iterations  is 
reduced  to  flog([l  +  log  c] /2  log 3)]  which  is,  approximately,  equivalent  to  replacing 
■/c  with  in  the  original  inequality.  With  this  growth  factor  for  the  early  iterations 
our  revised  algorithm  becomes: 

Algorithm  CRECIP  (The  accelerated  Ceiling  version  of  RECIP) 

Input:  Y 

Output:  [M/y] 

begin 
Zi  ^  0 
Z2  ^2 

while  Z\  ^  Zi  and  Zi  —  1  ^  ^2  do 
2^^  ^ _ ^2 

Z2  ^  [Zi*(2M-y 


if  Z2  ~  2Zj  then  Z2  — 
return  Z\ 
end. 

We  note  that  in  the  second  form  of  the  Newton  iteration  both  2Zi  and  Z^  have 
been  computed  before  the  test  and  acceleration  step  are  performed. 

5.  Numerical  experiments.  The  analysis  above  suggests  that  either  of  the  ac¬ 
celeration  approaches  described  in  the  previous  section  is  likely  to  represent  a  signifi¬ 
cant  saving  relative  to  the  original  modified  algorithm  of  Section  3.  Results  of  tests  are 
included  in  this  section  which  illustrate  clearly  that  substantial  savings  in  the  number 
of  iterations  required  are  indeed  achieved  by  the  algorithm  CRECIP.  The  experiments 
were  performed  using  conventional  integer  arithmetic  with  a,  32-bit  two’s  complement 
representation.  The  number  of  iterations  required  for  convergence  was  recorded  for 
each  of  the  algorithms:  the  modified  algorithm  of  Section  3,  the  accelerated  version 
using  a  =  1  as  described  and  finally  CRECIP  with  a  =  3/2. 

With  either  of  the  accelerated  algorithms,  the  progress  will  stiU  be  slow  if  the  first 
iterate  greater  than  ^/c  is  only  slightly  greater  than  this  threshold.  Whether  this  is 
more  likely  to  happen  with  a  =  1  or  with  o  =  3/2  will  vary  with  c.  For  any  particular 
value,  the  fastest  convergence  will  be  achieved  if  an  iterate  is  only  slightly  smaller 
than  -y/c  so  that  the  next  one  is  close  to  2y/c.  For  this  reason  there  are  cases  in  the 
results  where  a  =  1  appears  to  be  better  than  a  =  3/2.  However  the  general  result  is 
clearly  that  the  latter  choice  is  to  be  preferred.  As  the  dynamic  range  of  the  integers 
increases  the  advantage  to  be  gained  from  this  latter  choice  would  grow. 

Specifically,  the  numbers  of  iterations  were  obtained  for  c  =  2, 3, ...,  10  and  for  ran¬ 
domly  generated  values  in  each  of  the  intervals  •  10^,  (n  -f  1)  •  10*^j  for  n  =  2, 3, ...,  10 
and  fc  =  1, 2, ...,  8.  Partial  results  for  one  particular  case  form  Table  1  and  a  summary 
of  the  performance  over  several  runs  of  this  experiment  is  presented  as  Table  2. 

The  pattern  observed  even  in  these  partial  results  is  typical  of  all  the  results 
generated.  There  are  cases  where  the  a  =  1  form  of  the  algorithm  performs  somewhat 
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better  than  CRECIP  (a  =  3/2)  but  the  proportional  improvement  when  the  latteris 
best  is  much  greater. 

By  using  random  samples  such  as  those  described  above  rather  than  the  full  range 
of  integers  in  the  dynamic  range,  we  avoid  biasing  the  results  artificially  in  favor  of  the 
CRECIP  algorithm  due  to  a  preponderance  of  large  values  of  c.  Also  by  using  ’’powers 
of  10”  as  the  basis  for  the  random  sample  any  effect  of  a  pattern  in  the  distribution  of 
the  square  roots  relative  to  the  accelerated  iterates  powers  of  2  or  powers  of  3  is  likely 
to  be  minimized. 

TABLE  1  Numbers  of  iterations  required  for  various  forms  of  the  integer 
reciprocation  algorithm _ 


c 

Original 

a  =  1 

a  =  3/2 

10 

5 

5 

5 

54 

9 

8 

4 

94 

10 

9 

7 

105 

10 

9 

7 

286 

12 

7 

9 

4949 

17 

13 

7 

8830 

18 

14 

9 

53996 

21 

17 

13 

90028 

21 

10 

13 

723813 

25 

14 

17 

29040699 

30 

19 

8 

34088349 

31 

20 

10 

914885094 

36 

25 

16 

1051248160 

36 

25 

16 

In  Table  2,  we  present  the  total  numbers  of  iterates  for  each  of  the  algorithms 
for  complete  runs  of  the  randomized  test  described  above.  There  were  five  runs  in 
which  all  three  algorithms  were  tested  and  a  further  three  using  just  the  two  forms  of 
the  accelerated  algorithm.  AU  eight  of  these  are  taken  into  account  in  generating  the 
approximate  averages  which  form  the  final  line  of  the  table. 


In  this  section  we  have  seen  that  the  use  of  the  ceiling  form  of  the  algorithm  permits 
significant  savings  due  to  the  possibility  of  generating  exact  iterate-doubling  when 
current  iterates  are  well-removed  from  the  desired  reciprocal.  This  iterate-doubling 
also  has  a  simple  interpretation  in  the  inference  that  such  iterates  are  smaller  than  y/c. 
Further  analysis  of  the  errors  shows  that  using  a  >  1  is  advantageous  and  because  of 
the  requirement  for  exact  integer  arithmetic  the  choice  a  =  3/2  is  optimal. 

6.  Conclusions.  In  this  paper  we  have  presented  a  modified  version  of  the  RNS- 
integer  division  algorithm  introduced  by  Hitz  and  Kaltofen  in  [6].  This  modified 
version  has  several  advantages  over  the  original  unless  a  very  large  amount  of  hardware 
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js  available  kiiplement  the  parallel  acceleration  technique  described  in  [6]. 

The  basis  for  the  improvements  is  the  use  of  the  ceiling  function  in  place  of  the 
Roor  in  the  original  algorithm.  The  advantages  gained  are,  firstly,  that  the  additional 
magnitude  comparison  required  by  [6]  is  avoided  since  the  underlying  reciprocation  will 
always  yield  the  ceiling  of  the  desired  quotient.  (Of  course,  subtraction  of  unity  would 
deliver  the  floor  if  this  is  required  for  a  speciflc  purpose.)  The  magnitude  comparisons 
saved  involve  full  MRS  conversions  and  base  extensions  between  the  base-RNS  and 
the  extension-RNS  systems.  These  are  potentially  expensive  operations. 

The  second  major  gain  is  that  the  modified  algorithm  CRECIP  lends  itself  to 
substantial  acceleration  of  its  underlying  Newton  iteration  due  to  the  iterate-doubling 
inherent  in  the  ceiling  form  of  the  integer  algorithm.  A  careful  use  of  the  error  analysis 
yields  the  optimal  parameter  a  for  this  acceleration  scheme.  This  value  is  incorporated 
into  the  algorithm  described  here.  (It  is  conceivable  that  if  the  value  of  a  is  allowed 
to  vary  that  an  even  better  scheme  can  be  devised.) 

The  premise  for  this  modification  is  that  in  a  context  such  as  adaptive  beamform¬ 
ing  the  dynamic  range  requirement  is  such  that  the  number  of  modular  arithmetic 
units  required  for  the  hardware-acceleration  suggested  in  [6]  is  unlikely  to  be  avail¬ 
able.  The  modified  algorithm  together  with  suitable  pipelining  of  the  linear  algebra 
operations  of  Gauss  elimination  (or  Gauss- Jordan)  could  significantly  reduce  the  time- 
penalty  associated  with  RNS-division.  This,  in  turn,  may  be  sufficient  to  render  the 
RNS-solution  of  ABF  problems  using  such  elimination  algorithms  practicable  for  larger 
systems  than  hitherto. 
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